Projected Clustering for Huge Data Sets in MapReduce
نویسندگان
چکیده
Fast growing data sets with a very high number of attributes become a common situation in social, industry and scientific areas. A meaningful analysis of these data sets requires sophisticated data mining techniques as projected clustering that are able to deal with such complex data. In this work, we investigate solutions for extending the state-of-theart projected clustering algorithm P3C for large data sets in highdimensional spaces. We show that the original model of the P3C algorithm is not suitable to deal with huge data sets. Therefore, we propose the necessary changes of the underlying clustering model and then present an efficient MapReduce-based implementation our novel P3C-MR algorithm. The effectiveness of the proposed changes on large data sets and the efficiency of the P3C-MR algorithm are comprehensively evaluated on synthetic and real-world data sets. Additionally, we propose the P3C-MR-Light algorithm, a simplified version of P3C-MR that shows extraordinary good results in terms of runtime and result quality on large data sets. In the end, we compare our solutions to existing approaches.
منابع مشابه
An Accelerated MapReduce-Based K-prototypes for Big Data
Big data are often characterized by a huge volume and a variety of attributes namely, numerical and categorical. To address this issue, this paper proposes an accelerated MapReduce-based k-prototypes method. The proposed method is based on pruning strategy to accelerate the clustering process by reducing the unnecessary distance computations between cluster centers and data points. Experiments ...
متن کاملScalable Data Clustering Using Fermi GPUs on FutureGrid
The applications in science are creating huge amount of data sets. These data sets need to be classified into subsets in order to draw some meaningful conclusions. Data clustering is the statistical analysis process that groups similar objects into relatively homogeneous sets which are called clusters. The computational demands of data clustering grow rapidly. And it is very time consuming for ...
متن کاملEnhancing Map-Reduce Framework for Bigdata with Hierarchical Clustering
MapReduce is a software framework that allows certain kinds of parallelizable or distributable problems involving large data sets to be solved using computing clusters. This paper introduces our experience of grouping internet users by mining a huge volume of web access log of up to 500 gigabytes. The application is realized using hierarchical clustering algorithms with Map-Reduce, a parallel p...
متن کاملMapReduce K-Means based Co-Clustering Approach for Web Page Recommendation System
Co-clustering is one of the data mining techniques used for web usage mining. Co-clustering Web log data is the process of simultaneous categorization of both users and pages. It is used to extract the users’ information based on subset of pages. Nowadays, the cyberspace is filled with huge volume of data distributed across the world. The business knowledge acquaintance from such a voluminous d...
متن کاملA Robust Density-Based Clustering Approach Using DBCURE –MapReduce Techniques
Clustering is the process of grouping similar data into clusters and dissimilar data into different clusters. Density-based clustering is a useful clustering approach such as DBSCAN and OPTICS. The increasing volume of data and varying size of data sets lead the clustering process challenging. So that we propose a parallel framework of clustering with advanced approach called MapReduce. We deve...
متن کامل